By Eric Fan, December 2021
International news is often local: News outlets curate their international reporting to the taste of their domestic audience; International news also competes against local news, so they are more cost-effective if domestically relevant. Drawing inspiration from Professor Stray’s blog How Many World Wide Webs Are There?, I use tf-idf, SVD dimensionality reduction and k-means clustering to explore which news topics are covered and how information flow across American and Chinese news outlets when they report on each other's country. Key questions to consider:
Using NewsAPI, I collected headlines and short desciptions of:
This is a class project for Frontiers of Computational Journalism with Professor Jonathan Stray at Columbia University.
If you want to skip the details, here are my main findings:
# standard
import re
import json
import pprint
import pandas as pd
from pandas import NamedAgg
import numpy as np
# translation
import googletrans
# nlp for data cleaning
import spacy
# sklearn stuff
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
# visualizations
import altair as alt
import matplotlib.pyplot as plt
from wordcloud import WordCloud
Data collected from NewsAPI was very messy and required a lot of cleaning. The data came in as lists of dictionaries, with each dictionary consisting of data for one article. In this section, I will walk through each of the cleaning steps I took. First, I created a data_clearner method that removes junk text from headlines and descriptions:
# data cleaner helper method
remove_list = ['<table>', '</table>', '<tr>', '<td>', '</td>', '</tr>', '(左)', '(右)','(中)', '<ol>', '<li>', '</ol>', '</li>']
def data_cleaner(dictionary, trash_description_triggers, title_suffix_triggers):
for article in dictionary:
# remove everything in remove_list
for junk in remove_list:
article['title'] = article['title'].replace(f'{junk}', '')
article['description'] = article['description'].replace(f'{junk}', '')
# remove trash descriptions
for trigger in trash_description_triggers:
trash_description_match = re.findall(rf"{trigger}", article['description'])
if trash_description_match:
article['description'] = ""
# remove title suffixes
for trigger in title_suffix_triggers:
title_suffix_match = re.match(rf"(.*){trigger}", article['title'])
if title_suffix_match:
article['title'] = title_suffix_match.group(1)
Second, I created a pair of extract_text methods that extract the headlines and descriptions from each article. For Chinese language articles, I translated them into English using the googletrans library (not an official Google Translation API, but free) in Python.
# helper method to extract and translate text from chinese article title and descriptions
translator = googletrans.Translator()
def translateCnToEng(TextChinese):
translation = translator.translate(TextChinese, src='zh-CN', dest='en')
result = re.sub('([.,!?():])', r'\1 ', translation.text)
return result
def extract_text_cn(dictionary):
results = [translateCnToEng(article['title'] + " " + article['description']) for article in dictionary]
return results
def extract_text_en(dictionary):
results = [article['title'] + " " + article['description'] for article in dictionary]
return results
Third, I used the spacy package to tokenize the text and do some basic NLP, such as removing stop words and punctuations. Here, I also removed tokens like "United States" and "China" to help with further analysis, given that these words are expected to be common among all of our articles.
nlp = spacy.load('en_core_web_sm')
wordsToRemove = ['pron', '', ' ', ' ' 'united','states','america','american','china','chinese','td','tr','ol','li']
def extractWords(data):
results = []
for text in data:
doc = nlp(text.lower())
bow = [
token.lemma_ for token in doc
if (token.lemma_ not in wordsToRemove) and not (token.is_stop or token.is_punct)
]
results.append(bow)
return results
Finally, I created two master methods to pull everything together, one for Chinese language articles and one for English language ones. I will use these two methods to process all my data before clustering.
def generate_bow_cn(input_path, original_list, words_list, source, date, trash_description_triggers, title_suffix_triggers):
inp = json.load(open(input_path, "r", encoding="utf8"))['articles']
data_cleaner(inp, trash_description_triggers, title_suffix_triggers)
original_list = original_list + [article['title'] + " " + article['description'] for article in inp]
source = source + [article['source']['name'] for article in inp]
date = date + [article['publishedAt'] for article in inp]
extractedText = extract_text_cn(inp)
words_list = words_list + extractWords(extractedText)
return original_list, words_list, source, date
def generate_bow_en(input_path, original_list, words_list, source, date, trash_description_triggers, title_suffix_triggers):
inp = json.load(open(input_path, "r", encoding="utf8"))['articles']
data_cleaner(inp, trash_description_triggers, title_suffix_triggers)
extractedText = extract_text_en(inp)
original_list = original_list + extractedText
source = source + [article['source']['name'] for article in inp]
date = date + [article['publishedAt'] for article in inp]
words_list = words_list + extractWords(extractedText)
return original_list, words_list, source, date
Now that I have all the helper methods I need, let's start analyzing the data, by first processing the Chinese language articles.
origial_cn = []
words_cn = []
source = []
date = []
origial_cn, words_cn, source, date = generate_bow_cn("data/chinanews_20211004_20211020.json", origial_cn, words_cn, source, date, ['中国新闻网'], ['中新网', '中国新闻网'])
origial_cn, words_cn, source, date = generate_bow_cn("data/chinanews_20211021_20211104.json", origial_cn, words_cn, source, date, ['中国新闻网'], ['中新网', '中国新闻网'])
origial_cn, words_cn, source, date = generate_bow_cn("data/huanqiu_20211004_20211020.json", origial_cn, words_cn, source, date, ['环球网'], ['环球网'])
origial_cn, words_cn, source, date = generate_bow_cn("data/huanqiu_20211021_20211104.json", origial_cn, words_cn, source, date, ['环球网'], ['环球网'])
origial_cn, words_cn, source, date = generate_bow_cn("data/people_20211004_20211020.json", origial_cn, words_cn, source, date, [], [])
origial_cn, words_cn, source, date = generate_bow_cn("data/people_20211021_20211104.json", origial_cn, words_cn, source, date, [], [])
origial_cn, words_cn, source, date = generate_bow_cn("data/nytimesCN_20211004_20211104.json", origial_cn, words_cn, source, date, ['纽约时报中文网'], ['纽约时报中文网'])
origial_cn, words_cn, source, date = generate_bow_cn("data/wsjCN_20211004_20211104.json", origial_cn, words_cn, source, date, ['华尔街日报中文网'], ['华尔街日报中文网'])
words_cn is a list of lists of words (tokens), but the tf-idf vectorizer takes in a list of strings. So I join together all the words from each article as a long string and pass it into the tf-idf vectorizer.
words_cn_joined = [" ".join(article) for article in words_cn]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(words_cn_joined)
X
<476x3516 sparse matrix of type '<class 'numpy.float64'>' with 12430 stored elements in Compressed Sparse Row format>
Now I have a matrix of 476 articles and 3516 words. It's a sparse matrix as any single article uses only a small number of words. The next step is to do k-means clustering. Here I use the Elbow Method to find an optimal k.
Sum_of_squared_distances = []
K = range(2,15)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10, random_state = 2021)
km = km.fit(X)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
It looks like 8 and 12 are good candidates for k. I decide to use 8 because I want to limit the number of clusters for better interpretation.
n_clusters = 8
model = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=200, n_init=10, random_state = 2021)
model.fit(X)
labels=model.labels_
Here I save the outcomes of clustering together with all the article information into a CSV file.
result = pd.DataFrame([[a, b, c, d, e] for a, b, c, d, e in zip(words_cn_joined, labels, origial_cn, source, date)], columns = ['articles','labels', 'original_text', 'source', 'date'])
# result.to_csv('cn_clustered.csv', index=False)
Now, let's visualize the clusters using word clouds and interpret them based on the most common words.
for k in range(n_clusters):
s=result[result.labels==k] # pick the articles under same label
count=len(result[result.labels==k])
text=s['articles'].str.cat(sep=' ')
# text=text.lower()
text=' '.join([word for word in text.split()])
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white",collocations=False).generate(text)
print(f'Cluster: {k}, Number of articles: {count}')
# print('Titles')
# titles=wiki_cl[wiki_cl.cluster==k]['title']
# print(titles.to_string(index=False))
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Cluster: 0, Number of articles: 122
Cluster: 1, Number of articles: 190
Cluster: 2, Number of articles: 11
Cluster: 3, Number of articles: 18
Cluster: 4, Number of articles: 10
Cluster: 5, Number of articles: 45
Cluster: 6, Number of articles: 45
Cluster: 7, Number of articles: 35
My interpretation of the Chinese language clusters, after manually inspecting individual articles within each of them:
Cluster 0. "people, party, other" - This cluster includes a lot of Chinese state media articles that talk about "people" and "party", primarily domestically-focused stuff. But it also seems like a catch-all cluster that includes all the articles that do not fit into other clusters.
Cluster 1. "covid" - This cluster is about COVID-19. "New Crown" is the literal translation from "新冠", Chinese for novel coronavirus.
Cluster 2. "surveys" - This cluster, together with the "national youth" cluster, consists of a series of People.com articles about surveying young people and discovering their patriotism and love for their parents.
Cluster 3. "climate" - This is a climate change cluster.
Cluster 4. "national youth" - Same as the "surveys" cluster.
Cluster 5. "foreign relations" - This cluster has all the hard-core foreign affairs news. It's usually about a statement from the Chinese Foreign Ministry or the Russian Foreign Ministry. The Chinese state media often reference Russian support as they argue against US narratives.
Cluster 6. "economic growth" - This cluster is about economic growth.
Cluster 7. "beijing olympic" - This cluster is about the 2022 Winter Olympic Games in Beijing and how the city is planning to control COVID during the games.
Based on these clusters, we can see that about half of all Chinese media articles about the US were actually talking about Chinese domestic issues. It is worth noticing that there is no US-specific cluster among the seven. This is consistent with my hypothesis that international news in China is very "local," in that state media focuses on topics that are mostly about China, such as economic growth and the Beijing Olympic.
But if these were just about China, why did they all mention the US? Upon close inspection of the articles, I realized that most of them use the US as a point of comparison to prove China's superiority. In other words, these US-related reports were usually not about informing readers about America, but instead efforts to:
I will further develop these findings in later sections. But before that, let's first take a look at the English language articles.
origial_en = []
words_en = []
source = []
date = []
origial_en, words_en, source, date = generate_bow_en("data/cnn_20211004_20211012.json", origial_en, words_en, source, date, ['URL'], ['CNN'])
origial_en, words_en, source, date = generate_bow_en("data/cnn_20211013_20211020.json", origial_en, words_en, source, date, ['URL'], ['CNN'])
origial_en, words_en, source, date = generate_bow_en("data/cnn_20211021_20211028.json", origial_en, words_en, source, date, ['URL'], ['CNN'])
origial_en, words_en, source, date = generate_bow_en("data/nytimes_20211004_20211014.json", origial_en, words_en, source, date, ['The New York Times','URL'], ['The New York Times'])
origial_en, words_en, source, date = generate_bow_en("data/nytimes_20211015_20211025.json", origial_en, words_en, source, date, ['The New York Times','URL'], ['The New York Times'])
origial_en, words_en, source, date = generate_bow_en("data/nytimes_20211026_20211104.json", origial_en, words_en, source, date, ['The New York Times','URL'], ['The New York Times'])
origial_en, words_en, source, date = generate_bow_en("data/wsj_20211004_20211104.json", origial_en, words_en, source, date, ['story, link','<li>'], ['The Wall Street Journal'])
origial_en, words_en, source, date = generate_bow_en("data/foxnews_20211004_20211015.json", origial_en, words_en, source, date, [], [])
origial_en, words_en, source, date = generate_bow_en("data/foxnews_20211016_20211024.json", origial_en, words_en, source, date, [], [])
origial_en, words_en, source, date = generate_bow_en("data/foxnews_20211025_20211102.json", origial_en, words_en, source, date, [], [])
origial_en, words_en, source, date = generate_bow_en("data/foxnews_20211103_20211104.json", origial_en, words_en, source, date, [], [])
Again, I vectorize the articles using if-idf and try to find an optimal k using the Elbow Method.
words_en_joined = [" ".join(article) for article in words_en]
vectorizer = TfidfVectorizer()
X_en = vectorizer.fit_transform(words_en_joined)
X_en
<826x4485 sparse matrix of type '<class 'numpy.float64'>' with 15357 stored elements in Compressed Sparse Row format>
Sum_of_squared_distances = []
K = range(5,25)
for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10, random_state = 2021)
km = km.fit(X_en)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
15 and 21 are good candidates. Here I use 15 to limit the number of clusters for easier interpretation.
n_clusters_en = 15
model_en = KMeans(n_clusters=n_clusters_en, init='k-means++', max_iter=200, n_init=10, random_state = 2021)
model_en.fit(X_en)
labels_en=model_en.labels_
Again, I save the results of clustering into a CSV file for later analysis. Then I use word clouds to visualization the clusters.
result_en = pd.DataFrame([[a, b, c, d, e] for a, b, c, d, e in zip(words_en_joined, labels_en, origial_en, source, date)], columns = ['articles','labels', 'original_text', 'source', 'date'])
# result_en.to_csv('en_clustered.csv', index=False)
from wordcloud import WordCloud
for k in range(n_clusters_en):
s=result_en[result_en.labels==k] # pick the articles under same label
count=len(result_en[result_en.labels==k])
text=s['articles'].str.cat(sep=' ')
text=' '.join([word for word in text.split()])
print(f'Cluster: {k}, Number of articles: {count}')
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white",collocations=False).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Cluster: 0, Number of articles: 36
Cluster: 1, Number of articles: 282
Cluster: 2, Number of articles: 33
Cluster: 3, Number of articles: 20
Cluster: 4, Number of articles: 52
Cluster: 5, Number of articles: 13
Cluster: 6, Number of articles: 52
Cluster: 7, Number of articles: 24
Cluster: 8, Number of articles: 66
Cluster: 9, Number of articles: 66
Cluster: 10, Number of articles: 42
Cluster: 11, Number of articles: 36
Cluster: 12, Number of articles: 28
Cluster: 13, Number of articles: 49
Cluster: 14, Number of articles: 27
My interpretation of the English language clusters, after manually inspecting individual articles within each of them:
Cluster 0. "hypersonic missile test" - This cluster of articles talks about a hypersonic missile test that China allegedly conducted in August 2021. It was first reported by the Financial Times on October 17. China denied that it was a missile test, insisting that it was a spacecraft.
Cluster 1. "other" - This seems to be a catch-all cluster that includes all the articles that do not fit into other clusters.
Cluster 2. "taiwan" - This cluster is about Taiwan and the military threats it faces from mainland China.
Cluster 3. "fauci" - This is about Anthony Fauci and gain-of-function research. Upon further inspection, this is a cluster unique to Fox News.
Cluster 4. "hong kong" - This is a distinct cluster about Hong Kong and the national security law that took effect in early July.
Cluster 5. "october, know" - This is a series of CNN articles with titles like "5 things to know for October 15" and "5 things to know for October 18".
Cluster 6. "covid" - Articles about COVID-19.
Cluster 7. "evergrande" - This is a cluster about Evergrande (恒大) the second-largest real estate company in China. It recently got into debt troubles and spurred fear of major defaults.
Cluster 8. "biden" - Articles about US President Joe Biden.
Cluster 9. "climate" - Articles about climate and COP26, the climate summit that took place at the end of October.
Cluster 10. "climate" - Articles about climate and energy issues. I will combine this cluster with the previous one and code them as one "climate" cluster.
Cluster 11. "tucker, ingrahm, hannity" - This is another unique Fox News cluster. These are all articles about Tucker Carlson, Laura Ingraham and Sean Hannity talking about China.
Cluster 12. "xi jinping" - Articles about Chinese President Xi Jinping.
Cluster 13. "company, economy, stock" - Articles about the economy and the stock market.
Cluster 14. "supply chain" - Articles about supply chain problems.
Different from Chinese state media, the American outlets seem to be less domestically-focused. There are many China-specific clusters, such as Taiwan, Hong Kong, Evergrande, and Xi Jinping. When a Chinese news article mentions the US, it's usually about China. When an English article mentions China, it is usually about a China-specific topic. However, we do have several overlaps with the Chinese language clusters, such as COVID, climate and the economy.
I also discovered some clusters unique to Fox News. It appears that only Fox News is talking about Anthony Fauci, gain-of-function research and their connections to China.
Next, I will break down the clusters by new sources (outlets) and analyze several interesting topics.
Here, I load back the results of clustering and re-label them with the "topics" that I interpreted based on common words.
result_en = pd.read_csv("en_clustered.csv", parse_dates=['date'])
result_cn = pd.read_csv("cn_clustered.csv", parse_dates=['date'])
result_cn.drop_duplicates(keep=False,inplace=True)
result_en.drop_duplicates(keep=False,inplace=True)
di_en = {0: "hypersonic missile test", 1: "other", 2: "taiwan", 3: "fauci", 4: "hong kong", 5: "october, know", 6: "covid", 7: "evergrande", 8: "biden", 9: "climate", 10: "climate", 11: "tucker, ingrahm, hannity", 12: "xi jinping", 13: "company, economy, stock", 14: "supply chain",}
result_en_labeled = result_en.replace({"labels": di_en})
result_en_labeled['date_only'] = pd.to_datetime(pd.to_datetime(result_en_labeled['date']).dt.date)
di_cn = {0: "people, party, other", 1: "covid", 2: "surveys", 3: "climate", 4: "national youth", 5: "foreign relations", 6: "economic growth", 7: "beijing olympic"}
result_cn_labeled = result_cn.replace({"labels": di_cn})
result_cn_labeled['date_only'] = pd.to_datetime(pd.to_datetime(result_cn_labeled['date']).dt.date)
Let's first look at our 476 Chinese language articles. As shown below, several topics stood out: COVID, the economy, foreign relations and the Beijing Olympics.
result_cn_labeled_count = result_cn_labeled.groupby("labels").agg(count=NamedAgg(column="labels",aggfunc="count")).reset_index()
alt.Chart(result_cn_labeled_count).mark_bar().encode(
x=alt.X('count'),
y=alt.Y('labels', sort=alt.EncodingSortField(field="count", order='descending'))
).properties(
title='Chinese articles - all categories'
)
alt.Chart(result_cn_labeled[(result_cn_labeled['labels']!='people, party, other')]).mark_bar().encode(
x=alt.X('count(source)'),
y='source',
color='labels',
order=alt.Order(
# Sort the segments of the bars by this field
'labels',
sort='ascending'
)
).properties(
title='Chinese articles by source by categories (except others)'
)
Huanqiu/Global Times (nationalist semi-tabloid with majority domestic audience) had the most coverage on the Beijing Olympics and foreign relations news (statements from foreign ministries, news about diplomats, etc. ) These were often Chinese, Russian and other government's criticism of US policies, in other words, "feel-good" stories about how China is winning over the US diplomatically.
In contrast, People (THE authoritative state media) covers more domestic topics (especially economic growth), even when explicitly referencing the US. It also has a unique cluster: surveys. These are a series of articles referencing a survey conducted by the China Youth Daily, that suggested that young people in China are increasingly patriotic and respectful to their parents (the two most promoted virtues of all). Since 2019, the Chinese propaganda departments have put an explicit emphasis on youth.
My sample of NYT Chinese language news was too small (only 6) so it's hard to draw conclusions.
Now let's look at the distribution of topics proportion-wise
alt.Chart(result_cn_labeled[result_cn_labeled['source']!='New York Times']).mark_bar().encode(
x=alt.X('count(source)', stack="normalize"),
y='source',
color='labels',
order=alt.Order(
# Sort the segments of the bars by this field
'labels',
sort='ascending'
)
).properties(
title='Chinese articles by source - proportion of categories'
)
Chinanews was formerly run by the Overseas Chinese Affairs Office, which was absorbed into the United Front Work Department of the Chinese Communist Party (CCP) in 2018. Its operations have traditionally been directed at overseas Chinese worldwide and residents of Hong Kong, Macau and Taiwan. We can see that Chinanews focused more on COVID-related news, as opposed to "feel-good" stories on foreign relations run by Huanqiu, or domestic topics covered by People. This pattern is exactly what I expected. Because Chinanews's main mission is to establish a "united front" with potentially sympathetic people abroad, its reporting focused on how China successfully controlled COVID while the US failed. It avoided many of the Huanqiu-style nationalist stories about foreign affairs as it tries not to come across as offensive to its American readers.
The WSJ Chinese language reports (62 in total) presented some interesting findings. Just like Chinanews, the WSJ's Chinese edition is trying to reach a foreign audience and give them alternative views on issues that concern both China and the US. Just like how Chinanews avoided reports that might sound offensive to American readers, most of WSJ-CN's stories were not opinionated at all. They were usually matter-of-fact reports about international affairs (who did what) and the Chinese economy. However, upon further inspection, I noticed a difference between WSJ-CN's and People's coverage of the Chinese economy. WSJ-CN reported and speculated on government regulation and it had extensive coverage on Evergrande (恒大), a major Chinese real estate firm in debt troubles. People had no coverage on Evergrande (恒大) at all. Instead, it cited broad statistics to argue that Chinese exports, financial markets, rural development, and the general economy was doing well.
Now let's break down the English language clusters by outlets. The biggest "topics" are climate, Biden, Hong Kong, COVID and economy. Although I said previously that the American media are less domestically-oriented than the Chinese media, they are still very much so, as evident in the "Biden" cluster.
result_en_labeled_count = result_en_labeled.groupby("labels").agg(count=NamedAgg(column="labels",aggfunc="count")).reset_index()
alt.Chart(result_en_labeled_count).mark_bar().encode(
x=alt.X('count'),
y=alt.Y('labels', sort=alt.EncodingSortField(field="count", order='descending'))
).properties(
title='English articles - all categories'
)
alt.Chart(result_en_labeled).mark_bar().encode(
x=alt.X('count(source)', stack="normalize"),
y='source',
color='labels',
order=alt.Order(
# Sort the segments of the bars by this field
'labels',
sort='ascending'
)
).properties(
title='English articles by source - proportion of categories'
)
As shown above, the Wall Street Journal, as expected, cares most about economic issues and financial markets. It has significantly more articles about company/economy/stock and Evergrande, the Chinese real estate developer. It has no article in the Biden cluster and the most proportion of articles in the "Xi Jinping" cluster, show that it is also the least domestically-focused outlet among the four.
Fox News seems to be the most domestically-focused media. It focused its discussion on Joe Biden a lot when talking about China.
Let's take a look at the distribution after removing the "other" cluster and these three outlier clusters:
alt.Chart(result_en_labeled[ (result_en_labeled['labels']!='other') &
(result_en_labeled['labels']!='october, know') &
(result_en_labeled['labels']!='fauci') &
(result_en_labeled['labels']!='tucker, ingrahm, hannity') ]).mark_bar().encode(
x=alt.X('count(source)'),
y='source',
color='labels',
order=alt.Order(
# Sort the segments of the bars by this field
'labels',
sort='ascending'
)
).properties(
title='English articles by source by select categories'
)
Next, I will take a closer look at three specific topics: the missile test, Taiwan and COVID.
China allegedly tested a hypersonic missile that circled the globe in August. The incident was first reported by the Financial Times on October 17. However, China denied that it was a missile test, insisting that it was a spacecraft.
All four American outlets in my sample reported on this subject. But only one out of three Chinese state media outlets, Huangqiu/Global Times, did. Below, I will plot a timeline of reports related to this subject by each outlet.
result_en_labeled['missile'] = np.where( (result_en_labeled['articles'].str.contains('missile')) | (result_en_labeled['articles'].str.contains('hypersonic')) , True, False)
result_cn_labeled['missile'] = np.where((result_cn_labeled['original_text'].str.contains('超音速')) , True, False)
missile_cn = result_cn_labeled[result_cn_labeled['missile']==True]
missile_en = result_en_labeled[result_en_labeled['missile']==True]
missile_combined = missile_cn.append(missile_en)
alt.Chart(missile_combined).mark_bar().encode(
x=alt.X('date_only:T', scale=alt.Scale(domain=["2021-10-04", "2021-11-04"])),
y='count(date_only)',
color='source',
row=alt.Row('source:N')
).properties(
height=80,
width=400,
title='Timeline: Hypersonic missile articles by source'
)
Fox News was the first to report on the alleged missile test, on the same day that FT broke the news. CNN reported on the second day. Compared to the cable news stations, NYT and WSJ reported on it much later and also had much less coverage in general .
Interestingly, only one Chinese state media, Huangqiu/Global Times, reported on the test. In its article, Huanqiu repeated the Chinese government's claim that it was an ordinary spacecraft and enthusiastically scolded the American media for blowing it out of proportion. Huanqiu published its story only one day after FT broke the news, showing that it was closely following the American reports and well-prepared to push back against those articles.
As a pattern, Huanqiu loves referencing foreign media sources, such as "美媒" "俄媒" "外媒", either as evidence for China's superiority or to react against foreign reports with nationalist fury. Some examples:
result_cn_labeled[ result_cn_labeled['original_text'].str.contains('美媒') ]
| articles | labels | original_text | source | date | date_only | missile | |
|---|---|---|---|---|---|---|---|
| 126 | news medium president clinton admission new cr... | covid | 快讯!美媒:美国前总统克林顿入院,与感染新冠无关 - | Huanqiu.com | 2021-10-15 01:27:00+00:00 | 2021-10-15 | False |
| 128 | medium nuclear power submarine hit unknown obj... | foreign relations | 美媒:美国核动力潜艇在南海撞上不明物体,多人受伤 - | Huanqiu.com | 2021-10-07 23:22:00+00:00 | 2021-10-07 | False |
| 195 | changjin lake box office break 2 2 billion bri... | people, party, other | 《长津湖》票房已破22亿,英美媒体围观:超过同期好莱坞巨制! - | Huanqiu.com | 2021-10-05 06:17:00+00:00 | 2021-10-05 | False |
| 261 | medium million people apply beijing winter oly... | beijing olympic | 美媒:超百万人申请成为北京冬奥志愿者,凸显中国民众对政府疫情防控颇有信心! - | Huanqiu.com | 2021-11-03 23:07:00+00:00 | 2021-11-03 | False |
English articles had a significant Taiwan cluster, but not the Chinese language articles.
Most interestingly, the Chinese version of WSJ seems to be much more careful on this topic than its English version. WSJ-CN published only two articles about Taiwan. Both of them addressed mainland China's concerns:
In contrast, the English version of WSJ published seven articles about Taiwan, including two corresponding to the Chinese language articles. But in general, the English articles were more alarmist about threats from mainland China and most of them stood from the perspective of Taiwan and the US, such as:
This is another piece of evidence supporting my hypothesis that, just like Chinanews, the WSJ's Chinese edition is trying to win over a foreign audience and is careful about not offending them. It seems to be selectively publishing stories that are milder and less anti-China.
Now, let's look at the coverage of Taiwan across all outlets in both languages. Below, I create a subset of data that contains all articles that mentioned Taiwan.
result_en_labeled['taiwan'] = np.where(result_en_labeled['articles'].str.contains('taiwan'), True, False)
result_en_labeled_taiwan = result_en_labeled[result_en_labeled['taiwan']==True]
result_cn_labeled['taiwan'] = np.where(result_cn_labeled['original_text'].str.contains('台湾'), True, False)
result_cn_labeled_taiwan = result_cn_labeled[result_cn_labeled['taiwan']==True]
taiwan_combined = result_cn_labeled_taiwan.append(result_en_labeled_taiwan).reset_index()
Here I will first use dimensionality reduction to visualize the distribution of stories on their principal components, and then try to find clusters of common words that help with interpretation. I decided to use TruncatedSVD for dimensionality reduction as opposed to PCA, because the former better handles sparse matrices. TruncatedSVD is essentially PCA without centering the data first. Centering the data in a sparse matrix can be impractical.
vectorizer = TfidfVectorizer()
X_taiwan = vectorizer.fit_transform(taiwan_combined['articles'])
svd = TruncatedSVD(n_components=2, n_iter=15, random_state=2021)
data = svd.fit_transform(X_taiwan)
principalDf = pd.DataFrame(data = data
, columns = ['principal component 1', 'principal component 2'])
taiwan_combined = taiwan_combined.join(principalDf, how='outer')
alt.Chart(taiwan_combined).mark_point().encode(
x=alt.X('principal component 1'),
y=alt.Y('principal component 2'),
color='source'
).properties(
title='Results of dimensionality reduction - Taiwan articles'
)
At first glance, there is not much of a pattern with any particular news source. But there seems to be a general correlation between the two principal components. As we go toward the positive direction of PC1, the variance along the PC2 axis increases. In the next chart, I combine the outlets into two categories: Chinese state media and American media.
taiwan_combined['Chinese state media'] = np.where( (taiwan_combined['source'].str.contains('Chinanews')) | (taiwan_combined['source'].str.contains('Huanqiu')), True, False)
alt.Chart(taiwan_combined).mark_point().encode(
x=alt.X('principal component 1'),
y=alt.Y('principal component 2'),
color='Chinese state media'
).properties(
title='Results of dimensionality reduction - Taiwan articles'
)
There were only six articles from Chinanews and Huanqiu. Note that this does NOT mean People did not publish any article on Taiwan during the period. It just means that People did not explicitly mention the United States in its coverage of Taiwan. Based on the above chart, we can see the Chinese state media stories clustering to the negative direction of PC1 and basically constant along PC2. There's a divergence of American media articles along PC2. Based on this, I am expecting to find two or three meaning clusters.
n_clusters_taiwan = 3
model_taiwan = KMeans(n_clusters=n_clusters_taiwan, init='k-means++', max_iter=200, n_init=10, random_state = 2021)
model_taiwan.fit(X_taiwan)
labels_taiwan=model_taiwan.labels_
labels_taiwan
taiwan_combined['labels_taiwan'] = labels_taiwan
result_taiwan = taiwan_combined
for k in range(n_clusters_taiwan):
s=result_taiwan[result_taiwan.labels_taiwan==k] # pick the articles under same label
count=len(result_taiwan[result_taiwan.labels_taiwan==k])
text=s['articles'].str.cat(sep=' ')
text=' '.join([word for word in text.split()])
print(f'Cluster: {k}, Number of articles: {count}')
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white",collocations=False).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Cluster: 0, Number of articles: 17
Cluster: 1, Number of articles: 31
Cluster: 2, Number of articles: 11
As it turned out, almost all Chinese media articles belong to cluster 1, the biggest cluster that corresponds to the negative PC1 direction. Cluster 0 and cluster 2 represent the divergence of American media articles along PC2. Cluster 0 is mainly about the Biden administration's policy on Taiwan and cluster 2 is about Chinese President Xi Jinping's statements on reunification with Taiwan.
The most common topic between Chinese and English language articles was COVID. All seven outlets covered COVID extensively when they reported on US and China. Does each outlet cover COVID the same way? Can we find a Chinese-American divide here as well?
First, I subset out all articles that mentioned "COVID", "coronavirus", or "新冠". Noticed that I removed two junk categories 'october, know' and 'fauci' to make TruncatedSVD results more meaningful.
result_en_labeled['covid'] = np.where( (result_en_labeled['articles'].str.contains('covid')) | (result_en_labeled['articles'].str.contains('coronavirus')), True, False)
result_en_labeled_covid = result_en_labeled[result_en_labeled['covid']==True]
result_cn_labeled['covid'] = np.where(result_cn_labeled['original_text'].str.contains('新冠'), True, False)
result_cn_labeled_covid = result_cn_labeled[result_cn_labeled['covid']==True]
covid_combined = result_cn_labeled_covid.append(result_en_labeled_covid).reset_index(drop = True)
covid_combined = covid_combined[(covid_combined['labels']!='october, know')&(covid_combined['labels']!='fauci')].reset_index(drop = True)
Because I am now mixing English and Chinese language articles, I must be careful about artifacts of the translation process. Here, I remove all languages of "covid," "coronavirus," "new crown" (literal translation of 新冠), and other translations of 新冠. This is an important step because these language differences caused by translation would severely affect the results of dimensionality reduction and clustering algorithms.
covid_combined['articles'] = covid_combined['articles'].str.replace('crown', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('covid', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('coronavirus', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('coronary', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('new', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('pneumonia', '', regex=False)
covid_combined['articles'] = covid_combined['articles'].str.replace('neoguan', '', regex=False)
alt.Chart(covid_combined).mark_bar().encode(
x=alt.X('count(source)'),
y=alt.Y('source')
).properties(
title='All COVID-related articles'
)
Now, I use TruncatedSVD for dimensionality reduction and plot the distribution along their two principal components.
vectorizer = TfidfVectorizer()
X_covid = vectorizer.fit_transform(covid_combined['articles'])
svd = TruncatedSVD(n_components=2, n_iter=15, random_state=2021)
data = svd.fit_transform(X_covid)
principalDf = pd.DataFrame(data = data
, columns = ['principal component 1', 'principal component 2'])
covid_combined = covid_combined.join(principalDf, how='outer')
As shown below, there does not seem to be a significant cluster based on any particular source, except for Fox News articles, which tend to aggregate toward the negative direction of PC1. There's also a larger divergence along the PC2 axis as we move toward the positive direction of PC1.
alt.Chart(covid_combined).mark_point().encode(
x=alt.X('principal component 1'),
y=alt.Y('principal component 2'),
color='source'
).properties(
title='Results of dimensionality reduction - COVID articles'
)
Again, I combine the outlets into two categories: Chinese state media and American media. Here, we can see a rough pattern that distinguishes the two: Chinese state media articles tend to live toward the positive direction of PC1. There's no significant difference along the PC2 axes between Chinese and American outlets. This suggests that we may have 2 meaningful clusters based on whether or not an article is a Chinese state media outlet.
covid_combined['Chinese state media'] = np.where( (covid_combined['source'].str.contains('Chinanews')) | (covid_combined['source'].str.contains('Huanqiu')), True, False)
alt.Chart(covid_combined).mark_point().encode(
x=alt.X('principal component 1'),
y=alt.Y('principal component 2'),
color='Chinese state media'
).properties(
title='Results of dimensionality reduction - COVID articles'
)
Again, I use k-means for clustering and create corresponding word clouds for interpretation.
n_clusters_covid =2
model_covid = KMeans(n_clusters=n_clusters_covid, init='k-means++', max_iter=200, n_init=10, random_state = 2021)
model_covid.fit(X_covid)
labels_covid=model_covid.labels_
labels_covid
covid_combined['labels_covid'] = labels_covid
result_covid = covid_combined
for k in range(n_clusters_covid):
s=result_covid[result_covid.labels_covid==k] # pick the articles under same label
count=len(result_covid[result_covid.labels_covid==k])
text=s['articles'].str.cat(sep=' ')
text=' '.join([word for word in text.split()])
print(f'Cluster: {k}, Number of articles: {count}')
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white",collocations=False).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Cluster: 0, Number of articles: 60
Cluster: 1, Number of articles: 52
We have two clusters: one centered on new COVID cases, one centered on vaccines. Let's plot out the distribution of these clusters among different outlets. Will there be a significant difference between Chinese state media and American media, as I predicted based on SVD?
covid_combined['labels_covid'] = covid_combined['labels_covid'].map({0: 'cases', 1: 'vaccines'})
alt.Chart(covid_combined).mark_bar().encode(
x=alt.X('count(source)', stack="normalize"),
y=alt.Y('source', sort=alt.EncodingSortField(field="count(source)", order='descending')),
color='labels_covid'
).properties(
title='Proportion by source by cluster'
)
Yes! The American outlets focused overwhelmingly on COVID cases while the Chinese outlets reported more on vaccines. Upon closer inspection, I found that the American outlets reported extensively on new cases in China and their implications for the Beijing Olympics. Again, this does not mean that Chinese media does not report on new domestic COVID cases, it just means that they do not often mention the US when reporting them.